This assignment is for ETC5521 Assignment 1 & 2 by Team bilby comprising of (Yuheng Cui), (Jimmy Effendy), Weihao Li, and Yan Ma.

1 Introduction and Motivation

Families and friends tend to spend their holidays and weekends in amusement parks. The popularity of amusement park has been growing in popularity in recent years, with worldwide attendance at the top 10 amusement park groups reached half a billion mark for the first time last year (Schneider, 2019), with 4% year-on-year growth (Index & Index/AECOM, 2019).

With this ever increasing popularity, it is therefore reasonable to consider that the safety of amusement rides is subject to substantial public interests (Woodcock, 2014). It was estimated that the annual number of ride-related injuries in North America was 1,289 in 2018; which was 26% higher compared to 2017 (Amusement Parks & Attractions, 2018). International Association of Amusement Parks and Attractions further stated that approximately 11% of those injuries are “serious”, meaning that they result in urgent admission and hospitalization for more than 24 hours for non-medical observation reasons, or causes fatality.

While accidents that occur in amusement parks are arguably not frequent, they generate prominent effects when they happen. International Association of Amusement Parks and Attractions (2017) reported that following a ride malfunction in Australia killing four people, the park and also other venues in Australia suffered considerable declines in attendance. This shows that public confidence, safety, and commercial feasibility are strongly interconnected (Woodcock, 2014).

Thorough evaluations to injury records related to amusement rides are aligned with public interests and are integral to encourage constant improvement in the industry. This paper aims to find factors that have major influence to amusement ride accidents. Uncovering this insight may encourage amusement park owners to utilize their resources more wisely in favor of those groups or equipment that are at the most risk. In addition, this report may also facilitate regulatory bodies when safety standards and regulations are developed.

We believe amusement park must make the effort to reduce the incidents; the effort refers to maintaining equipment, training staff and monitoring visitors. At same time, visitors should also follow the parks regulations and equipment instructions. After all, visitors should be responsible for their own safeties in the first place. Reduce the injuries occur in amusement parks and increase the well beings of society.

Firstly, description of data used in the report and how it is prepared for analysis will be discussed. Then, analysis and findings acquired from the dataset will be presented and discussed. In the end, there will be a conclusion on major findings and discussion on future studies.

2 Data Description

The datasets are downloaded from the Github repository of Tidy Tuesday. Tidy Tuesday (2020) is a weekly social data project in R. We are using datasets adopted by its activity on September 10, 2019.

There are two datasets provided in the repository; one originated from data.world and another from Saferparks database. An additional data from Texas Department of Insurance (TDI) about current available insurance policies is used for this report (TDI, 2020a).

2.1 Injury records in Texas amusement parks (tx_injuries.csv)

The Texas amusement parks injury dataset originated from data.world is collected by Texas Department of Insurance (TDI) (Millerbernd, 2018). It is a record of any injuries caused by an amusement ride from February the 1st 2013 to February the 1st 2017 occurring in the State of Texas (Millerbernd, 2018). An amusement ride is “any mechanical, gravity, or water device or devices that carry or convey passengers along, around, or over a fixed or restricted route or course or within a defined area for the purpose of giving its passengers amusement, pleasure, or excitement” (Insurance, 2019). According to TDI (2019), a quarterly injury report needed to be submitted by amusement ride owners and operators. TDI further stated that this record relates to any injuries that require medical treatment or result in death.

This dataset has 542 number of observations and 13 number of variables. The name, type and description of each variable in tx_injuries.csv can be found in the data dictionary below.

variable class description
injury_report_rec double Unique Record ID
name_of_operation character Company name
city character City
st character State (all TX)
injury_date character Injury date - note there are some different formats
ride_name character Ride Name
serial_no character Serial number of ride
gender character Gender of the injured individual
age character Age of the injured individual
body_part character Body part injured
alleged_injury character Alleged injury - type of injury
cause_of_injury character Approximate cause of the injury (free text)
other character Anecdotal information in addition to cause of injury

From the data quality check in Fig 2.1, we find the dataset has high percentage of missing values in variable other and serial_no. However, it has minor effect on our analysis. Thus, they can be removed from the dataset. Besides, there are potential issues with the data types of variable age and injury_date, which can be solved by converting them into integer and date respectively.

A visualization of data quality check on Injury records in Texas amusement parks. The data plot is an "at-a-glance" plot generated using the `visdat` package. The y axis represent each observation, and the x axis represent each variable. In this dataset, most of the variables are provided in "character" with few missing values. Serious data quality issues can be observed in variable `serial_no` and `other`, but they are irrelevant to this analysis.

Figure 2.1: A visualization of data quality check on Injury records in Texas amusement parks. The data plot is an “at-a-glance” plot generated using the visdat package. The y axis represent each observation, and the x axis represent each variable. In this dataset, most of the variables are provided in “character” with few missing values. Serious data quality issues can be observed in variable serial_no and other, but they are irrelevant to this analysis.

2.2 Accidents records (safer_parks.csv)

The accidents records in the Saferparks dataset, on the other hand, originated from the U.S. State and Federal safety agencies regulating amusement rides (Saferparks, 2020). Saferparks achieved this by submitting public records requests to those agencies (both federal and states agencies). In some cases, additional requests were submitted to specific agencies to achieve particular goals (Saferparks, 2020). As a result, Saferparks needs to harmonize these datasets into their database.

It has 8351 observations and 23 variables. The name, type and description of each variable in safer_parks.csv can be found in the data dictionary below.

variable class description
acc_id double Unique ID
acc_date character Accident Date
acc_state character Accident State
acc_city character Accident City
fix_port character .
source character Source of injury report
bus_type character Business type
industry_sector character Industry sector
device_category character Device category
device_type character Device type
tradename_or_generic character Common name of the device
manufacturer character Manufacturer of device
num_injured double Num injured
age_youngest double Youngest individual injured
gender character Gender of individual injured
acc_desc character Description of accident
injury_desc character Injury description
report character Report URL
category character Category of accident
mechanical double Mechanical failure (binary NA/1)
op_error double Operator error (binary NA/1)
employee double Employee error (binary NA/1)
notes character Additional notes

The data quality check in Fig 2.2 shows 4 variables have too many missing values to be analysed, therefore, we ought to ignore them. acc_date should be converted to date variable. Although manufacturer has many missing values as well, we decided to keep it because it may be insightful.

A data quality check on the Saferparks accidents dataset. The y axis represent each observation, and the x axis represent each variable. The data plot is an "at-a-glance" plot. This dataset has serious missing value issues in variable `manufacturer`, `report`, `notes`, `mechanical`, `op_error` and `employee`.

Figure 2.2: A data quality check on the Saferparks accidents dataset. The y axis represent each observation, and the x axis represent each variable. The data plot is an “at-a-glance” plot. This dataset has serious missing value issues in variable manufacturer, report, notes, mechanical, op_error and employee.

2.3 Insurance policies (tx_policies.csv)

The insurance policies dataset originated from Texas Department of Insurance. This dataset lists the amusement ride current insurance policies in Texas. It has 683 observations and 5 variables. The name, type and description of each variable in tx_policies.csv can be found in the data dictionary below.

variable class description
Record integer Unique ID
Name of Operation Character Name of operation
Expiration Date Character The expiration date of the insurance policy
Agent Character The name of the sales agent
Carrier Character The name of the carrier

This dataset has very few missing values. Around 99.9% cases are completed.

2.4 Data Limitation

As documentations related to accident reports provided by TDI are limited, it is difficult to determine the limitation related to the dataset. In addition to a considerable amount of missing values in some of the variables, data dictionary is not provided by TDI. As a result, a fair amount of guesstimates were required for some of the variables provided. Lastly, there is some inconsistency of format in injury_date variable.

According to Saferparks (n.d.), reporting criteria and its level of details, types of equipment included, and years covered vary widely across year, industry sector, jurisdictions, and other factors. Saferparks further stated that States that are transparent, vigilantly monitor safety incidents, and implement data management systems that are efficient will log higher number of accidents. In other words, having high number of injuries may be an indications of being more attentive to safety, not less (Saferparks, n.d.).

While the dataset can be used to uncover insights of how patrons got hurt in amusement rides, Saferparks do not recommend the dataset to be used for comparison across states, parks, rides or years (Saferparks, n.d.). One of the reasons for this is that State laws in relation to amusement ride related injury reporting vary widely. For instance, it is mandatory to report go-kart accidents in Florida but not in California.

Due to these limitations, the report will not use the Saferparks dataset to analyze nation-wide patterns. Hence, this report will largely focus on amusement park accidents that occur in Texas, United States of America.

2.5 Data Cleaning and Transformation

A considerable amount of data wangling needed to be done to the TDI injury datasets prior to the analysis. Firstly, there were inconsistencies of format in the date variable where some of the observations were stored in a serial number format that only Excel recognizes (e.g. 39448). Dates with this format needed to be converted to “YYYY-MM-DD” format.

Secondly, the following variables were added to the TDI injury dataset:

  • injury_year: the year when the observed injury occurred
  • injury_month: the month when the observed injury occurred
  • injury_day: the day when the observed injury occurred
  • season: the season (U.S.) when the observed injury occurred

TDI dataset about current insurance policies were also needed to be cleaned. Firstly, the janitor (Firke, 2020) package was used to make the column names tidy. Next, the agent variable were needed to be wrangled as there were many observations that were misspelled.

Finally, the TDI injury dataset were combined with TDI insurance policies dataset.

Much of the data wrangling and transformation process were done by utilizing the dplyr (Wickham, François, Henry, & Müller, 2020) and lubridate (Grolemund & Wickham, 2011) packages.

3 Analysis and findings

3.1 Do age and gender play a major role in amusement park accidents?

The primary question in this report is to discover the patterns in age and gender distribution of amusement park accidents. Consequently, help parks to identify certain high-risk groups and also remind the groups taking care of their own. The Saferparks dataset cannot be used here, because the dataset does not have detail individual records. Instead, we use the injury dataset reported by TDI.

In Figure 3.1, the overall distribution shape between male and female is not significantly different. There are very few cases around age 20 which could be a potential data quality issue. Unfortunately, this question is out of our scope, and can’t be solved using the data on hand.

Figure 3.1: A faceted bar plot for age distribution in amusement park accident records. Missing values are removed. The x axis is the age, the y axis is the percentage. The empirical distribution is multimodal. There are few cases around age 20.

Table 3.1: Top 10 most frequent age group in amusement parks accident records in Texas. A high percentage of injured people are teenagers.
Age Percent
(10,15] 15.96
(5,10] 12.32
(15,20] 10.51
(30,35] 9.89
(25,30] 7.47
(35,40] 6.86
(0,5] 5.24
(40,45] 4.64
(45,50] 4.44
(20,25] 3.63
Table 3.2: Top 10 most frequent group by gender and age in amusement parks accident records in Texas. Female are at a higher risk of injury.
Gender Age Percent
F (10,15] 8.68
M (10,15] 7.28
M (5,10] 6.87
F (15,20] 5.86
F (30,35] 5.85
F (5,10] 5.45
F (35,40] 4.65
M (15,20] 4.65
M (30,35] 4.04
M (25,30] 3.83

Table 3.1 shows the ranking of injured ages. Injured babies occupies around 5% of the dataset. In the dataset 16.53% injured are under 16, except babies. We distinguish babies from other age groups because babies are carried by their parents and babies are not able to move to any places by their own.

If we take gender into account, like Table 3.2, which shows the top-10 rankings by gender and age, we will find that more girls get injured than boys in amusement parks in Texas.

In summary, the age range of the injured is broad (from 0 to 71). But the most injured are young people, especially for children (under 18). And the fact that around 5 percent of the injured are babies indicates that parents must take care of their babies in amusement parks. They should put their babies in first priority.

3.2 What is the most dangerous equipment in parks?

We count the total number of injuries by device type. Table 3.3 shows top-10 rankings of high-risk equipment. It is not surprising that slide and roller coaster are equipments that cause the most injuries. Gutierrez (2016) reported that seven cases are related to roller coaster among eight high-profile U.S. amusement park deaths before 2016.

Table 3.3: Top 10 most dangerours devices types. Slide and coaster are major devices contribute to injuries.
Device Type Total Number of Injuries
Slide 1622
Coaster 1218
Trampoline court 678
Go-kart 648
Aquatic play area 337
Track ride 311
Flume ride 183
Train/tram 171
Inflatable bouncer 160
Boat ride 159

Amusement parks should pay extra attention on roller coaster because unlike other subjects, roller coaster’s failure can cause severe consequence — no one can escape once a roller coaster is launched. First, they should regularly maintain the roller coaster, in order to reduce mechanical failure. Second, train the staff and ask them to check to-do list every time launch the equipments. Third, ask visitors to follow the safety guide.

3.3 Does amusement ride injuries have seasonal characteristics?

This section will examine whether seasonal trends affected number of amusement ride injuries across the year. Table 3.4 shows that rides related injuries have seasonal trends and they are consistent across the years. The number of injuries occurred in autumn and winter seasons were relatively low. The number started to rise in spring, and reached its peak in summer.

Table 3.4: Amusement rides related injuries across years and seasons. A clear seasonal trend can be observed.
Season 2013 2014 2015 2016 2017
Autumn 11 6 9 7 6
Spring 30 36 20 23 17
Summer 81 55 89 64 45
Winter 2 5 4 1 2

Figure 3.2 shows how the number of injuries occurred in amusement rides distributed across months and years. It is reflected in the graph that the highest number of injuries that occurred in a single month appeared in June 2015 with 41 injuries. Another interesting feature that appears in the graph is that the numbers of rides related injuries in 2014 and 2017 are relatively low compared to other years.

It may be beneficial for ride owners to focus their resources in spring and summer when number of injuries are at its height. More regular and rigorous inspections to the ride equipment can be performed during these periods. Furthermore, ride owners may also provide staff with additional training in the periods leading up to summer. It may also be advantageous to perform further study to uncover the true drivers of the following questions (which unfortunately are out of scope of this report):

  • Why did 2014 and 2017 have comparatively low number of injuries?
  • Why did June 2015 have such a high number of injuries?

Figure 3.2: A grouped line chart for ride related injuries to show seasonal trends. The main group is year. The x axis is month and the y axis is total number of injury. In 2015, there are significantly many cases reported in June, which is unusual.

3.6 What amusement parks have most injuries in Texas?

Since the injury reporting mechanism varies from state to state, we only analysed the injuries in each park in Texas (assuming that the injury reporting mechanism is the same across parks in Texas). We want to analyse not only the injuries happened every year in each park, but also the total injuries across all the years in each park, so we also kept the injury records with missing injury_year values.

Figure 3.7 shows the top 10 amusement parks which have the most injuries in Texas. We can see that the Six Flags Over Texas park has the most amusement park injuries in Texas, especially in year 2013 and 2014. The Sky group Investments LLC DBA iFly Houston Memorial park had a significant large number of injuries in 2015, while the Typhoon Texas - Austin Park had lots of injuries in 2016.

A segmented bar plot for amusement park Injuries Ranked by total number of injuries in Texas. The plot is filled by year. The x axis is number of injury and the y axis is park name. The top 2 park with most injuries are operated by Six Flags.

Figure 3.7: A segmented bar plot for amusement park Injuries Ranked by total number of injuries in Texas. The plot is filled by year. The x axis is number of injury and the y axis is park name. The top 2 park with most injuries are operated by Six Flags.

3.7 What manufacturer’s product caused the most amusement park injuries?

In this part, we will find out the manufacturers whose products caused most amusement injuries over the years. This could remind amusement parks who have devices from these manufacturers to maintain their devices carefully. Figure 3.8 shows the top 10 manufacturers whose products caused most amusement injuries over the years and the percentage of total injuries recorded in the safer_park data set.

There are 254 manufacturers’ records in this data set, and the total injuries of them is 8793, while the total injuries of the top 10 manufacturers is 2841. The top 10 manufacturers accounted for 32.3% of all amusement park injuries. It’s noteworthy that in-house manufactured devices accounts for more than 10% of total injuries.

A bar plot to show top 10 manufacturers with most injuries. The x axis is number of injury and the y axis is name of manufacturer. We can clearly see a great proportion of injuries are related to in-house devices.

Figure 3.8: A bar plot to show top 10 manufacturers with most injuries. The x axis is number of injury and the y axis is name of manufacturer. We can clearly see a great proportion of injuries are related to in-house devices.

4 Conclusions

In this report, we analysed the amusement park injuries data of the U.S. Since there’s no data related to the total number of visitors, it’s hard to say whether the parks with less injuries are more safer than others. Keeping this in mind, we obtain several interesting major findings.

  1. About half of the injured visitors are younger than 35 years old, and the age group with the most injuries is children between 10 and 15 years old. Injuries of girls in this age group is slightly more than that of boys.

  2. Slide and coasters caused more injuries than other kind of rides.

  3. In summer, there are more visitors injured in amusement parks, while in winter the number of injuries is much smaller. This may because there are more visitors in summer.

  4. Of all the injuries, the most are related to the devices insured by the Hartford Fire Insurance Company. But having a high number of injuries is not necessarily a reflection of an insurance company being negligence. It may be the case that the parks overseen by the insurance company attracts much more guests compared to other parks.

  5. In high proportion of injury cases, the injured get hurt of their upper limbs, lower limbs and neck. Girls are more likely to have lower limbs injuries.

  6. iFly operated by Skygroup investment LLC and Hurricane Harbor operated by Six Flags take responsibility to the unusual high figures of limbs injuries in 2015 and 2017. There are potential safety issues with their devices.

  7. In Texas, the Six Flags Over Texas park has the most amusement park injuries, especially in year 2013 and 2014. The Skygroup Investments LLC DBA iFly Houston Memorial park had a significant large number of injuries in 2015, while the Typhoon Texas Austin Park had lots of injuries in 2016.

  8. The in-house manufactured amusement park devices accounts for more than 10% of total injuries. This is a security hazard that can’t be ignored. The top 10 manufacturers who caused most injuries accounted for 32.3% of all injuries, while there are 254 manufacturers’ records in this data set.

5 Acknowledgments

The following packages are used to produce this report: visdat (Tierney, 2017), dplyr (Wickham et al., 2020), readr(Wickham, Hester, & Francois, 2018), tidyverse (Wickham et al., 2019), lubridate (Grolemund & Wickham, 2011), knitr (Xie, 2014), kableExtra (Zhu, 2019), tidytext (Silge & Robinson, 2016), wordcloud2 (Lang & Chien, 2018), janitor (Firke, 2020), here (Müller, 2017), plotly (Sievert, 2020), rlist(Ren, 2016)

References

Amusement Parks, I. A. of, & Attractions. (2018). In IAAPA RIDE SAFETY REPORT – NORTH AMERICA – 2018.

Firke, S. (2020). Janitor: Simple tools for examining and cleaning dirty data. Retrieved from https://CRAN.R-project.org/package=janitor

Grolemund, G., & Wickham, H. (2011). Dates and times made easy with lubridate. Journal of Statistical Software, 40(3), 1–25. Retrieved from http://www.jstatsoft.org/v40/i03/

Gutierrez, L. (2016). Eight high-profile u.s. Amusement park deaths in recent years. The Kansas City Star. Retrieved from https://www.kansascity.com/news/nation-world/national/article94407457.html

IAAPA. (2017). Global theme and amusement park outlook 2017–2021.

Index, T., & Index/AECOM, M. (2019). In TEA/AECOM 2019 Theme Index and Museum Index: The Global Attractions Attendance Report.

Insurance, T. D. of. (2019). Amusement ride faqs. Retrieved from https://www.tdi.texas.gov/commercial/lcamuseinfo.html#reports

Lang, D., & Chien, G.-t. (2018). Wordcloud2: Create word cloud by ’htmlwidget’. Retrieved from https://CRAN.R-project.org/package=wordcloud2

Millerbernd, A. (2018). Texas amusement park accidents. Retrieved from https://data.world/amillerbernd/texas-amusement-park-accidents

Müller, K. (2017). Here: A simpler way to find your files. Retrieved from https://CRAN.R-project.org/package=here

Ren, K. (2016). Rlist: A toolbox for non-tabular data manipulation. Retrieved from https://CRAN.R-project.org/package=rlist

Saferparks. (2020). Accident reports from state/federal regulators. Retrieved from https://ridesdatabase.org/saferparks/data/

Saferparks. (n.d.). In Saferparks Accident Data (pp. 2–3). Retrieved from https://ridesdatabase.org/wp-content/uploads/2020/02/Saferparks-data-description.pdf

Schneider, M. (2019). Theme park attendance crosses half-billion mark for 1st time. Retrieved from https://www.usnews.com/news/best-states/florida/articles/2019-05-23/theme-park-attendance-crosses-half-billion-mark-for-1st-time

Sievert, C. (2020). Interactive web-based data visualization with r, plotly, and shiny. Chapman; Hall/CRC. Retrieved from https://plotly-r.com

Silge, J., & Robinson, D. (2016). Tidytext: Text mining and analysis using tidy data principles in r. JOSS, 1(3). https://doi.org/10.21105/joss.00037

TDI. (2020a). Amusement ride current insurance policies. Retrieved from https://www.tdi.texas.gov/commercial/lcamusepolicy.html

TDI. (2020b). Amusement ride requirements. Retrieved from https://www.tdi.texas.gov/commercial/indexamusement.html

Tidy Tuesday. (2020). A weekly social data project in r. https://github.com/rfordatascience/tidytuesday.

Tierney, N. (2017). Visdat: Visualising whole data frames. JOSS, 2(16), 355. https://doi.org/10.21105/joss.00355

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H., François, R., Henry, L., & Müller, K. (2020). Dplyr: A grammar of data manipulation. Retrieved from https://CRAN.R-project.org/package=dplyr

Wickham, H., Hester, J., & Francois, R. (2018). Readr: Read rectangular text data. Retrieved from https://CRAN.R-project.org/package=readr

Woodcock, K. (2014). Amusement ride injury data in the united states. Safety Science, 62, 466–474.

Xie, Y. (2014). Knitr: A comprehensive tool for reproducible research in R. In V. Stodden, F. Leisch, & R. D. Peng (Eds.), Implementing reproducible computational research. Chapman; Hall/CRC. Retrieved from http://www.crcpress.com/product/isbn/9781466561595

Zhu, H. (2019). KableExtra: Construct complex table with ’kable’ and pipe syntax. Retrieved from https://CRAN.R-project.org/package=kableExtra